Building a Large-Scale Annotated Chinese Corpus
نویسندگان
چکیده
In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annotation quality, (iii) for what purposes is the corpus applicable, and finally (iv) what future work we anticipate.
منابع مشابه
Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora
We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated JapaneseChinese parallel corpus in the world. We created the corpus by selecting ...
متن کاملBuilding an Annotated Japanese-Chinese Parallel Corpus ¨C A Part of NICT Multilingual Corpora
We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated JapaneseChinese parallel corpus in the world. We created the corpus by selecting ...
متن کاملHow Should A Large Corpus Be Built? - A Comparative Study Of Closure In Annotated Newspaper Corpora From Two Chinese Sources, Towards Building A Larger Representative Corpus Merged From Representative Sublanguage Collections
This study measures comparative lexical and syntactic closure rates in annotated Chinese newspaper corpora from the Academica Sinica Balanced Corpus and the University of Pennsylvania's Chinese Treebank. It then draws inferences as to how large such corpora need be to be representative models of subject-matterconstrained language domains within the same genre. Future large corpora should be bui...
متن کاملBuilding Large Chinese Corpus for Spoken Dialogue Research in Specific Domains
Corpus is a valuable resource for information retrieval and data-driven natural language processing systems, especially for spoken dialogue research in specific domains. However, there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. ...
متن کاملEvaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures
Parsing is one of the important processes for natural language processing and, in general, a large-scale CFG is used to parse a wide variety of sentences. For many languages, a CFG is derived from a large-scale syntactically annotated corpus, and many parsing algorithms using CFGs have been proposed. However, we could not apply them to Japanese since a Japanese syntactically annotated corpus ha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002